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Abstract 

In this work, we propose a theory for information matching. It is motivated by the 
observation that retrieval is about the relevance matching between two sets of prop- 
erties (features), namely, the information need representation and information item 
representation. However, many probabilistic retrieval models rely on fixing one rep- 
resentation and optimizing the other (e.g. fixing the single information need and tun- 
ing the document) but not both. Therefore, it is difficult to use the available related 
information on both the document and the query at the same time in calculating the 
probability of relevance. In this work, we address the problem by hypothesizing the 
relevance as a logical relationship between the two sets of properties; the relationship 
is defined on two separate mappings between these properties. By using the hypoth- 
esis we develop a unified probabilistic relevance model which is capable of using all 
the available information. We validate the proposed theory by formulating and de- 
veloping probabilistic relevance ranking functions for both ad-hoc text retrieval and 
collaborative filtering. Our derivation in text retrieval illustrates the use of the theory 
in the situation where no relevance information is available. In collaborative filtering, 
we show that the resulting recommender model unifies the user and item informa- 
tion into a relevance ranking function without applying any dimensionality reduction 
techniques or computing explicit similarity between two different users (or items), in 
contrast to the stateoftheart recommender models. Q 

* The theory and the mathematical modelling presented in this report has not been published elsewhere. However, 
different applications of the theory are under review. 
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1 Introduction 

Information Retrieval (IR) is about relevance matching between two sets of properties (features), 
namely, the information need (query) representation and information item (document) represen- 
tation. In the past, many authors have attempted to define the various aspects of relevance ifTTl 
and many different models, both non-probabilistic and probabilistic, have been proposed to cap- 
ture the notion of relevance between them. Some of the influential probabilistic models include 
the classical probabilistic model developed by Robertson and Sparck Jones 11211 . the Probabilistic 
Indexing model of Maron and Kuhns lfT5l . the language modeling approaches by Ponte and Croft 
lfT8ll . and the risk minimization framework of Zhai and Lafferty [14J. The central problem in all 
these probabilistic models is the estimation of the probability of relevance, either implicitly or 
explicitly, between a given information item represented by a document and a need represented 
by a user query. 

We note that documents and queries (needs) are typically represented by sets of properties - we 
may think of vocabulary terms for example (other examples are discussed below). In general, 
there are two different approaches which the models bring to the formulation of the probability 
of relevance. In the first approach, the probability of relevance is defined by correlating each 
document with the information need properties of the users who would judge it relevant, i.e. 
conditioned by the given document. This approach is called the document-oriented view of the 
probability of relevance 11201 and includes Maron and Kuhns' Probability Indexing and the lan- 
guage models JEHUD. Whereas in the second approach, the probability of relevance is defined by 
correlating each user query with the information properties of those documents that they would 
judge relevant, i.e. conditioned by the given information need (query). This approach is called the 
query -oriented view and used in the Robertson-Sparck Jones model [|2TI . These two views rely 
on fixing one variable and optimizing the other, e.g. fixing the information need and tuning the 
document or the other way around, but not both [19]. In fact, none of the existing models can use 
the available relevance information on both the document and query in calculating the probability 
of relevance. 

Another important aspect of modern information retrieval modeling is to incorporate properties 
other than vocabulary terms into the relevance ranking function when computing the probability 
of relevance. For example, previous studies have shown that query independent features, such as 
PageRank [O, and the query independent document usage features, such as click-through rates 
and visit frequencies, can be utilized while calculating relevance [4J. As summarized in [17112611, 
useful information includes query side information such as click-through stream consisting of 
all the user queries that have a click on the given document, information from the past and as- 
sociated queries 112611 . relevant queries for the given document, and information from the set of 
relevant documents of the query. However, none of the current probabilistic retrieval models are 
capable of using all the information that might be available. This may be one of the reasons why 
learning-to-rank algorithms such as Lambda Rank [3] perform better than traditional probabilistic 
retrieval models such as BM25 [22J. In large scale web search engines, it is becoming increas- 
ingly common to see all available information about the query or/and document being used to 
learn a learning-to-rank model Q, which will then be utilized for ranking the documents based 
on their relevance to the given query. But in learning-to-rank models the results were optimized to 
general users and personalized ranking is difficult. So, there is no integration of all the informa- 
tion (including user's personal features) in a traditional probabilistic retrieval ranking framework, 
due to lack of a unified theory. 

On the other hand, recommendation (collaborative filtering) systems have some similarities to, as 
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well as some significant differences from, information retrieval systems. In both types of systems, 
we need to satisfy the requirements of a particular user by offering him/her particular items from 
a collection. In the case of information retrieval, we usually start from features (often words), but 
may also make use of user feedback (relevance feedback). In the case of recommendation, we 
usually start from feedback (user ratings) but may also make use of features. The most common 
approach to the task of recommendation relates strongly to information retrieval [28]. Given that 
in many recommendation situations we lack features that could be used directly, it is common to 
attempt to derive a set of hidden features which might explain the ratings that we observe, and use 
them to predict new ratings, from either a probabilistic [[Toll or non-probabilistic perspective [13J. 
These features are usually assumed to describe both users and items, so that both entities may 
be embedded in the same space - this parallels the information retrieval situation, where users 
(in this case user queries) and items both have words as features, and we consider both entities as 
points in a space defined by words. The usual assumption in such recommendation systems is that 
this space is of relatively low dimensionality; although this assumption is by no means universal 
in information retrieval, it is well represented there in the form of topic models such as PLSI [9] 
andLDA flD. 

Thus, in this work, we present a new retrieval theory that can incorporate all the different types 
of above information into a single model (and as well as personalize the ranking results). The 
basic idea is that the information need and information item are described with their respective 
properties, potentially from different sets. The matching for relevance then requires two sepa- 
rate mappings between these properties: one from the need to the item properties to identify 
which item properties are sought by each need, and one in the reverse direction to identify which 
need properties are 'sought' by each item. The relevance of the information need and informa- 
tion item can be then estimated based on a logical relationship of the mappings. The advantage 
of the unified theory, developed based on this simple idea, is that it is capable of utilizing any 
available information^] on both the document and the query in determining the probability of rel- 
evance. It is, thus, widely applicable to many information retrieval problems that requires the 
matching between two properties. We illustrate its potential and derive two practical algorithms 
by looking into the ad hoc text retrieval and collaborative filtering problems. On one hand, in text 
retrieval, we show that the theory can handle the situation when there is no relevance information 
available and derive a practical document ranking function. The TREC evaluation shows that the 
resulting ranking function outperforms some strong baselines. On the other hand, the application 
of the theory to recommender systems results in a new model that computes the probability of 
relevance between a user-item pair without applying any dimensionality reduction techniques or 
computing any explicit similarity metric between the users or items, in contrast to many state- 
of-the-art models, e.g. the Matrix Factorization and Dimension Reduction methods |fT3l[T0ll , the 
neighborhood-based methods 11271 l25l . Our experiments on movie rating data sets demonstrate 
that it performs significantly better than other baselines for the item ranking task. 

The remainder of the paper proceeds as follows. In Section 2, we present our unified retrieval 
theory, and in Section 3, show how to employ the theory to derive appropriate ranking functions 
for both the text retrieval and collaborative filtering tasks. We then report our experiments in 
Section 4, and finally conclude the paper in Section 5. 



2 This includes the information about other relevant documents to the given and document and other relevant 
queries to the given query. 
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2 Unified Retrieval Theory 

2.1 An Example 

Let us first start with a simple example to demonstrate the idea and insight behind our unified 
retrieval theory. We envisage a collection of employers seeking candidates to fill their job vacan- 
cies, and candidates (job seekers) seeking to find suitable positions. In general, each vacancy is 
described using its own properties and each candidate is described using his or her own proper- 
ties. There are some properties those can describe only the job vacancy or the candidate but not 
both, e.g. age, vacancy position salary, etc. A candidate with certain properties seeks a job with 
certain desired properties such as salary, position, etc., and similarly, an employer seeks to fill 
a vacancy with a candidate with certain properties such as qualifications, experience, languages 
known, etc. A vacancy is filled only if the position has the properties sought by the candidate 
and the candidate has the properties sought by the employer for this position. From a system 
perspective, to find an ideal match, we have to know the properties of candidate and vacancy, and 
also the properties in the other that are sought by each. 

A similar explanation in document retrieval would be an information need with certain properties 
seeks an information item with certain properties and an information item with certain properties 
seeks to satisfy information needs with certain properties. For example, if a query comes with 
an identified geolocation, this may (depending on the rest of the query) seek a document or page 
with a nearby geolocation (where the meaning of 'nearby' also depends on the rest of the query). 
Similarly, a page describing a restaurant will probably be 'seeking' relatively local people. On 
the other hand, we might hypothesize that any query is likely to seek an authoritative document 
(as measured by, say, PageRank). 

The basic idea here is that the information need and information item are described with their 
respective properties, potentially from different sets (we could think of these as vocabularies, but 
in principle the vocabulary for need-description is different from that for item-description). The 
matching for relevance then requires two separate mappings between these vocabularies: one 
from the need properties to the item properties (identifying which item properties are sought by 
each need), and one in the reverse direction (identifying which need properties are 'sought' by 
each item). 

2.2 A New Hypothesis 

Based on this idea presented in the above example, we propose a new hypothesis for IR by making 
following assumptions: (1) Any information (need/item or document/query) can be described by 
using a set of properties (concepts or features). (2) The complete set of properties that describe 
information needs may not be same as those that describe information items; (3) An information 
need seeks an information item with certain description properties and similarly an information 
item seeks to satisfy an information need with certain description properties; (4) All we know 
about an information need is encapsulated in the properties; therefore we will model the item 
properties sought by this need as a function of the need's properties; and vice-versa. We will also 
make the simplifying assumptions: (a) that all properties are binary, and (b) that the two functions 
indicated in (4) are linear and are represented by matrices. 

Now, we state an Hypothesis for Information Retrieval as: 

"Any information need or information item can be described using a set of properties, called 
need and item properties respectively. The relevance between an information item-need pair is 
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dependent only on the relationship between the need and item properties that describe them." 

In order to formulate the hypothesis, let iV be the set of k need properties that can describe 
any information need, where N = {n 1: n 2: ■ ■ ■ ,n k }. Thus, an information need, denoted I N , 
is described by a vector F, of k dimensions, with assumed binary values. Similarly, let T be 
the set of I information item properties, where T = {t 1: t 2: • • • ,ti}. An information item I T 
is represented by an /-dimensional binary vector E. Let Y be an iV x T information need seek 
matrix, representing the information item properties sought by an information need, given this 
need's properties. Each row in Y corresponds to an information need property (rif E N where / E 
{1, • • • , \N\}) and each column is an information item property (tf ET where / E {1, • • • , \T\}). 
As a simple case the values of the matrix can be binary, "1" if the information need property seeks 
the information item property, "0" otherwise, i.e, the values, n E N,t E T, Y[n, t] — 1 if n seeks 
t, "0" otherwise. Similarly, let Z be a T x A r information item seek matrix, representing the 
information need properties 'sought' by an information item, given this item's properties. Each 
row corresponds to an information item property and each columns corresponds to an information 
need property. The simple binary case, the values, n E N, t E T, Z[t, n] = 1 if the information 
item with property t seeks to satisfy an information need with property n. Here, Y, Z are property 
relationship matrices. 

Having defined the two matrices and expressed the relevance hypothesis, we can now put forward 
another explanation of the matrices. Considering Y, insofar as it maps needs onto item properties, 
it implicitly identifies similar needs (which may not start with the same need properties, but may 
be mapped onto the same item properties). This function of Y would emerge in a relevance 
feedback environment, from different users identifying the same items as relevant to their needs. 
Similarly, the matrix Z will identify similar documents, by mapping them onto the same need 
properties. These characteristics of the matrices can only be expected to emerge in a relevance 
feedback environment; they will become very clear in the case of collaborative filtering below. 
Our ad-hoc retrieval experiments do not at this stage include relevance feedback. 

Relevance under the Hypothesis: In this paper, we focus on a simple logical model of relevance 
(on the assumption of perfect knowledge of all properties and relationships), while bearing in 
mind the framework is a general one and other retrieval methods can be derived with different 
assumptions about the relevance. Specifically, the pair I N , It is assumed relevant if and only if: 
(1) all the "item properties" sought by the need In describe It', and (2) all the "need properties" 
'sought' by the item I T describe I N . Under the above hypothesis, we can replace the individual 
In in (1) by its properties, and infer the sought item properties by applying Z. Similarly, we 
can replace the individual I T in (2) by its properties, and infer the 'sought' need properties by 
applying Y. For a simple binary properties case, the relevance conditions can be expressed as 
follows: (1) Vi, j if nj = 1 & Y[n i: tf\ = 1 then tj = 1; (2) if tj = 1 and Z[tj, rii] = 1 then 
rii = I. 

2.3 Probabilistic Retrieval Model 

In order to develop a retrieval model based on the above definition of relevance, we would like 
to define a complete set of need and item properties and determine their values for a given infor- 
mation need or item, and also define the exact relationship matrices Y, Z. In practice, it is not 
possible to do so. So, an obvious way to develop a model based on the hypothesis is by defin- 
ing a restricted set of properties and probabilistically modeling their values. We assume that we 
have defined N, T and derive a probabilistic relevance ranking function to find the probability of 
relevance between I N , I T by introducing the uncertainty into the possible F, E values for I N , I T . 
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Similarly, we assume that there is an uncertainty over the exact Y, Z. 
2.3.1 Relevance Ranking Function 

The objective of the ranking function is to rank a set of information items for a given information 
need based on their probability of relevance. From the hypothesis we know that the relevance 
between I N , I T can be computed by using E, F, Y, Z. So, in order to rank the items for a given 
need, we compute the probability of relevance between any I N , I T , as follows, 

P(R = l\I N ,I T )= J]^^J]P( J R=l,E,F,y,Z|J^,J r ) (1) 

0/375 

where R = 1 means relevant, and a, (3, 7, 5 are all the possible binary vectors and matrices of E, 
F and Y, Z respectively. From the hypothesis, E, F, Y, Z are sufficient to determine the relevance 
between the Ijy and It- And also, E is dependent only on It, F is dependent only on In, and 
Y and Z are independent of 1^, It- By applying Bayesian transformations and independence 
assumptions, we get 

p( R = i|/ w , i T ) = p(r = i) y. E E E m pm'pm =1) p(E|fr)p(F|fw) <2) 

Here, we assume that the property description value of an item property to It is independent of 
other properties and similarly, need property to an In need is independent of other properties. We 
make another assumption that each entry value in Y is independent of other values in Y, similarly, 
the entries in Z^ Based on the these assumptions, we can write Eq.Q as 

p(r = i\i N , h) = p(R = i)J2zZzZzZUU p ( Ei > F ™> Z ^\ R = x ) 

a p 7 6 I m 

P{Ei\I T ) P(F m \I N ) 

P{F m ) 1 

Single property score 

where I G {1, 2, • • • , |T|} and m G {1, 2, • • • , \N\}. Eq.(|3]) is the final probabilistic unified 
relevance ranking function. 

To explain the behavior of the ranking function above, let us consider that there are two properties 
in set T, where T = {I tl , A}, I tl is a property associated with a vocabulary term and A is an "au- 
thority" property describes whether the information item is authoritative or not. For example, one 
might use PageRank to indicate authority, with a threshold to define a binary property. Similarly, 
consider one property in N, where iV = {N tl }, N tl is a term description property. Now, let us 

assume that the matrices Y and Z as follows, Y = [l l] and Z = [l — ] . Y[N t i, A] = 1 
means that any information need with property N t % seeks an information item with an authority 
property A. "— " indicates that the value could be "0" or "1", meaning that we assume that its 
value does not affect the relevance with respect to this property relationship. Note that the rank- 
ing function in Eq.(|3]) can use any information about the document (information item) or query 
(need) by modeling them as properties, defining their relationships through Y, Z matrices and es- 
timating their value for the given information need/item. Thus all information about the individual 
item, individual need, and other relevant need-item pairs that share property values, is included in 
determining the relevance, which is an essential for a unified model li20l . 

3 A 'need property' seeking an 'item property' is independent of other properties and vice-versa. 
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P(E=1|I T ) 
(a) Description probabilities 




0.2 0.4 0.6 0.8 
Term Property 

(b) Two properties 
Figure 1: The behavior of the Unified Ranking function. 



Now, if we substitute these values of Y, Z into Eq. ([3]) and assume that the only available in- 
formation about the information items is It, then the ranking function score depends on "Single 
property score" of both I tl and A in Eq.Q. To see how these scores affect the rankings score, 



we show graph 1(a) The two base axes are the numerator of the property (the probability that 
the property describes the information items), and denominator (the probability that the property 
describes any information item in general). The vertical axis is the relevance score as logarithm of 
"Single property score" of Eq. ([3]). The maximum relevance score is achieved when P(Ei = 1) 
is minimal and P(Ei = 1\It) is maximal. This then implies that the property describes very 
few information items (low P(E l = 1)) but well describes the particular information item I T 
(P(Ei = 1\I T = 1) is maximal). This is what one would expect of a reasonable ranking function. 
Graph [T(b) shows how the relevance score changes when there are two properties (authority and a 
term property) where X, Y axises describe each "Single property score" and Z axis is the sum of 
logarithmic scores of two properties. So, the overall relevance score depends on how important 
the properties are in describing all the information items and how well each of them describes 
the information item (or need). Note that, the relevance component in Eq. ([3]) includes adding 
the relevance information on both the information item and need which is essential to a unified 
model. We will use this component in the following applications. 
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3 Applications 

3.1 Ad-Hoc Text Retrieval 

To develop a text retrieval ranking function using the above unified theory, first, we need to define 
the property sets N and T. N (query or need properties) could be a set of properties associated 
with vocabulary words, query specific properties such as geolocation, query length, etc. Similarly, 
T could be a set of vocabulary term properties, document specific properties such as PageRank, 
url depth, etc. 

In a traditional ad-hoc retrieval task, the only available information to the retrieval models is vo- 
cabulary terms and their statistics in queries and documents. So, to derive a simple ad-hoc retrieval 
ranking function, we define both N and T as a set of "k" properties each corresponding to a single 
vocabulary term. We call this set of properties as "term-description" properties and represent with 
£. Now, to define the matrices Y and Z, we define a relationship between the information need 
and item properties as follows: As N = T, an information need with a description property, t, 
seeks an information need with same t and vice-versa (t G £). Based on the assumption, Y, Z 
matrices are defined as follows, Y = Z = M\£\ x \%\, where M^- = 1 if i = j, "0" otherwise and 

Following the above relation, the definition of relevance between an information item (document), 
d and an information need (query) q under the hypothesis reduces to a simple relationship where 
d and q are relevant if and only if E = F, i.e. the property description value of all the properties 
of d must be same as the that of q. We refer to this relevance relationship as relevance under 
"Strict identity" relation. The reason for this reduction is that we do not need Y and Z for the 
computation of relevance as we know that the same properties should describe both document and 
query if they are relevant, i.e the description property values for each property in £ must be same 
for both d, q. 

Ranking Function: Now, as per the above definition of relevance, the probability of relevance 
between d,q,P(R = l\d, q), can be computed as 

P(R = l\d, q) oc* £ £ J] P f (E F ' l p ^ PiEiWPWq) (4) 

a j3 i 

where oc R is rank equivalence (constant P(R = 1) is ignored) and i G {1, 2 • • • , |£|}. Eq. ([4]) is 
a unified relevance ranking function for ad-hoc retrieval when the same set of properties can de- 
scribe a document or query. Eq. uses the information about the description value of each prop- 
erty for the given document and query (P(Ei\d), P(F i \q)), and its value in the collection (P(£ , i ), 
P(Fi)) and the joint probability of property values those describe relevant document-query pairs. 
If there is a new relevant pair, its information will be added in computing the relevance. 

In traditional TREC collections there is a very little text on query side, so, to implement the 
ranking function we avoid the estimation of the property values for the given query by making the 
following assumption. Query property description assumption: As we have very little information 
(only two to three query terms) to infer the query description property values and the fact that each 
query term is very important in finding the relevant documents, we assume that each property 
description value corresponding to query terms is "1" for the given query and others properties 
are "0", i.e. we know the binary vector F. Basically, this assumption is similar to an implicit 
assumption that the query terms are elite to the query and other terms are non-elite as in [|24l . 
In what follows, we use the terms elite and non-elite as synonymous with 'has the property' and 
'does not have the property' respectively, for either users or items. 
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Now, let fq { denotes the property value of Fj and Fj = fq i where fq i = 1 if Fj is elite for q 
otherwise fq i = 0. Based on our assumption, Eq. © can be written as 



k 

P(R = l|d, g) oc* £ II ( ^#gj^T P ^' rf ) P ^ = ^) 



(5) 



P( J R = l|rf,g)cx^ J] p^ p{ ° Fi R = Q) 1] m = 0\R= l)P(E t \d)P(F z = Q|g) 



„ , , ^ P(E i )P(F i =fQ i , 

part I 

By applying Bayes' rule to the part 1 to Eq.© and factorizing, we get 

P(E l \F i = 0,R=l 

a \/i:Fi=0 

tt P(E,\F, = 1,R= l)P(F t = 1\R = l)P(gj|rf)P(Fi = l|g) 

From the Query property description assumption, we know the value of each element in F. So, 
we have P(P = l|g) = 1 if the term associated with the i th property in £ is present in query g 
and P(Fi — 0|g) = 1, otherwise. By substituting these values, Eq. ([6]) becomes 



a Vi:Fi=0 



W.Fi=l 



As defined, a document and a query is relevant if and only if F = E. From the definition, if we 
know that the property description of a property is "1" for the given query (i.e. Fj = 1), then the 
probability that of same property value is "1" (Fj = 1) in the relevant set of documents is "1", i.e. 
P(Ei = 1\R = 1, Fj = 1) = 1, as they have the same value in the relevant set. Equally, it is the 
same for the property value equal to "0" where P(Fj = 0\R = 1, Fj = 0) = 1. Note that from 
this assumption, the score of any vector in E of Eq. © is zero if Fj = and Fj = 1 (or) Fj = 1 
and F = for at least one i. By substituting these values in Eq. ©, we get 



where F(F = 0\R = 1), P(Fj = 1), P(Fj = 0) and P(Fj = 1\R = 1) in Eq. © can be removed 
as these terms do not affect the ranking order. We thus get 

'<-K')- n If n 

Eq. ([9]) is a ranking function under the Strict identity relation with the Query property description 
assumption. We ignore the terms with properties values "0" in Eq. © by assuming that the 
absence of terms represents unknown properties. By applying a logarithm transform to the ranking 
function results in the following ranking function: 

P(R=l\d,q)K R £ P pf E = =$ (10) 

Vi:Fi = l ^ 1 ' 
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The above simplification is similar to ignoring the terms that are not present in query in I12T1 [T8l 
[T4| . Note that each property score in ranking function in Eq.([T0l) has same behavior as shown in 



the graph 1(a) which is a desired characteristic of a relevance ranking function. 



One of the interesting by-products of our model is that the above formula in Eq. ( |T0| ) provides a 
yet another theoretical justification of IDF (inverse document frequency) as scoring function |[T2ll . 
To see this, let us assume that the property description value of a property is "1" to a document if 
the term is present in the document and "0" otherwise. Then, the probability of the property value 
being "1" in the collection is, P(Ei — 1) = where n ci is the number of documents in the 
collection with the term associated with the i th term-description property and N is total number 
of documents in the collection. From the above assumption, P(Ei — l\d) = 1 if Fj = 1 (term i is 
in the query). By substituting them in the ranking function in Eq. p0] ), we get 

P(R=l\d,q)cx R y2\og-= V log— (11) 

Pi n ci 



Now, the ranking function in Eq. ( fTTj ) is simply a function of IDF values of the query terms. 
Essentially, it implies that the IDF score function relies on the assumption that a term is elite if it 
occurs in the document. This is different from the explanation provided by the Robertson-Sparck 
Jones model, where an explicit assumption that the whole collection is a non-relevant set is needed 

EE 

To implement and test the ranking function in Eq. ( [10] ), we need to estimate the probabilities 
P(Ei = l\d) and P(Ei = 1) for each i th property in £. In order to estimate the probabilities, we 
assume the following generative process where, an author (or a user) will carry out the following 
process to express their information: (1) First, a user or author will choose a set of elite properties 
such that these properties can describe every aspect of the information that they want to express. 
(2) Once the properties are chosen, an observable information item or need, is generated by a 
stochastic function of chosen properties. The uncertainty about the description of the property for 
the information item is injected during this generation process. Now, we know that a document 
is generated from a set of term-properties. So, the occurrence of a term in a document has a 
stochastic element associated with the description of its corresponding term-description property. 
Therefore, we compute the probability of i th term-description property value being "1" for the 
given document d as P(Ei = l\d) = P(Ei = where tf i denotes the term frequency 

associated with i th term-description property in document d. As we assume that the description 
of a property for a document is binary, from the hypothesis, a property description is "1" for 
some documents in the collection and "0" for others. And, tf i follows one distribution in a set of 
documents that were described by the property and another distribution in second set of documents 
that were not described by the property. Therefore, we can draw a probabilistic inference about the 
description of a term-description property from its associated term's frequency in the document. 

By applying Bayes' rule to P{Ei = we get 

P(E - ~ m) ~ www <12) 

For simplicity, we use query terms to represent properties that describe the query (F = 1 when 



qt = 1). By substituting Eq. ( [T2| ) back in Eq. pO] ), we get the following ranking function 

P{tfi\Ei = 1) 

— 

Vi:F,=l 
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In order to estimate the probabilities in Eq. (fT3|), we assume that the collection of documents is 



a two component mixture for any given property. As £ is a set of term-description properties, 
we assume that the term frequency of the term associated with each property in £, follows a 
Poisson distribution in a set of documents that are described by the property, P(tf \\Ei — 1) = 
e~wW//j(l)#i, and another Poisson distribution in the other set, P{tf ^Ei = 0) = e^^/i^O)*^, 
where and fii(0) are the two Poisson means. The mixing probability P(Ei = 1) = pi 

is an additional parameter. This is the classic 2-Poisson mixture model (8l [23j with parameters 
/ii(l),/ii(0),pi. 

For inference in the above mixture model, we can approach either in a maximum likelihood (ML) 
or in a Bayesian framework coupled with Markov Chain Monte Carlo (MCMC) technique. For 
the experiments in the following section, we estimate the optimal parameter values of the mixture 
by using maximum likelihood estimation (using Expectation Maximization (EM) algorithm flU) 
as well as Gibbs sampling for finite mixtures via MCMC [6J. By substituting the estimated 
parameter values in Eq.(fT0]), we get the final ranking function 



Vi:F;=l 



The ranking function in Eq.([T4]) looks similar to the ranking functions in [|23ll24ll but is substan- 
tially different; the apparent similarity arises only from the use of the two Poisson distributional 
assumption. 

3.2 Collaborative Filtering ( CF) 

The unified probabilistic model in Section [2]can be directly used to rank and recommend a set of 
items for a given user once we define the properties that describe user, item and their relationships 
(Y, Z). So, in this section, we derive a ranking function specific to collaborative filtering (CF), 
when the only available information is the user-item rating matrix. 

Before deriving a relevance ranking model for CF, we describe the elements of the model in 
outline as follows: (1) Each individual user is assumed to have preferences for certain kinds of 
items, similar to our example employer who seeks a candidate with certain characteristics for a job 
vacancy. As we have initially no external indication of what 'kinds' of items exist, this preference 
function is an unknown over the entire item space. That is, each item has a preference value for 
this user - not as an individual item, but as a representative of 'items like this'; (2) In an exactly 
dual form, each individual item is assumed to have appeal to different kinds of users. Each user 
has an appeal value for this item - not as in individual user, but as a representative of 'users like 
this one' . (3) When an individual user sees an individual item, his/her reaction (rating) is assumed 
to be a stochastic function of the combination of user-item preference and item-user appeal. 

In this version of the model, in the absence of any other properties, the 'properties' of users are 
associated with individual items - e.g. 'this is an example of the kind of item that I like' . The 
function of the matrix Z is to map this back to users - in other words to identify other users who 



like similar things. Thus in this case the function of Z identified at the end of section 2.2 becomes 
very clear. Similarly, the properties of items are associated with individual users, and the matrix 
Y performs the dual mapping. 

Relevance under the hypothesis: Based on our hypothesis, a user u and item i pair is relevant 
if and only if: the "kinds of users" to whom item i appeals prefer the "kinds of items" preferred 
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by u and the "kinds of items" preferred by user u appeal to the "kinds of users" that the item i 
appeals to. The above definition of relevance is same as our general definition of relevance under 
the hypothesis. The only difference is the terminology replacing the need, item properties with 
"kinds of items" and "kinds of users" respectively. |^] 

Now, we derive a ranking function for CF using Eq.Q. In CF, the only available informa- 
tion about the users and items consists of the user-ids, the item-ids, and a set of ratings. Us- 
ing this information, we define I p as the set of properties with one per item-id, i.e. each 
item is a different kind of item. A given user has a preference for the properties in I p , where 
I p = {iid^Hdn • • • , iid N }- In other-words, a user is described using I p properties. Similarly, we 
define U a as the set properties with one property per user-id, i.e. there are M different kinds of 
users. And, a given item has an appeal factor to each user, where U a = {u idl ,u id2 , ■ ■ ■ , u idM }, i.e 
an item is described using U a ^\ 

Let Pf be the preference matrix (i.e. Y) representing the relationship between the "kinds of items" 
(user properties) to "kinds of users" (item properties) and similarly, A be the appeal matrix (Z). 
Let Pf m be the Ui dm property preference vector over I p where m £ {1, • - • , M} and Ui dm £ U a , 
i.e the n th entry in Pf m (Pf mn ) represents a binary value and equals to "1" if u idm prefers i idn , 
"0" otherwise. Similarly, A n be the i idn property appeal vector over U a , i.e the m th entry in A n 
(A nm ) is binary value and equals to "1" if i idn appeal to users with property u idm . By substituting 
the above values in the Eq. ([5]) we get, 

D / D -.i • \ D / D -, \ TT TT P{E n , F m , Pf mn , A id \R = 1) 
P(R = l\u u , i id ) = ^=1)2^2^1111 P(E )P(F ) 

a 7 6 n m \ n) \ m) 

P(E n \u)P(F m \t) (15) 

From the hypothesis, we know that if a user-item pair (u m , i n ) is relevant then E n = 1, F m = 1, 
Pfmn = 1 an d A nm = 1. In other words, a user with a property u m prefers an item i with property 
iidn ' an * appeals to a user with property u idm and u idm prefers the kind of item i idn , kind of item 
iidn appeal to the kind of user Ui dm . From the above assumption, if a user-item pair u, i is relevant 
then P(E n = l,F m = l,Pf mn = l,A nm = 1\R = 1) = lf\ By substituting these values in 
Eq.@, we get, 

P /D 11 ^ PfJD U TT P ( En r = l \ U ) P ( F m T = IK) \- \- \- \- 

m = 1K !) = P{R = 1} JL > -m-m^rr E E E E 



P{E nnr \u)P{F mnr \i) 
P(E nnr )P(F mnr ) 



(16) 



where n r ,m r such that < u mr ,i nr >£ UI re i where UI re i is the set of relevant < u,i > pairs. 
Similarly, n nr ,m nr such that < u mnr ,i nnr >^ UI re i. Now, we make an assumption that we 
have only a set of relevant user item pairs and then by approximating and removing the constant 

4 i.e. the user is represented by the kinds of items he prefers and item is represented by the kinds of users it appeals 

to. 

5 Ip, U a are same as N, T in general model. 
6 This assumption forces 

P(E n , F m , Pf mn , A nm \R = 1) = if any of the values E n , F m , Pf mn , A nm is zero for a relevant u, i pair. 
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P(R = 1) from the Eq. ([T6j), we get, 



pm n -x TT P{E nr = l\u)P{F mr = l\i) 



(17) 



To estimate preference and appeal in CF model, we make an assumption that an observed rating 
of a user-item pair, < u,i >, has a stochastic element associated with the item's appeal to the 
kind of user u belongs to and the user's preference for the kind of item that i belongs to. 

In order to estimate the preference distribution of an individual user over kinds of item, we further 
assume that this user's observed ratings are the result only of this user's preferences. Similarly, 
to estimate the appeal distribution of an individual item over kinds of user, we assume that the 
observed ratings on this item are the result only of this item's appeal. These two assumptions are 
clearly oversimplifications but more sophisticated models can be pursued in future work. 

Using the assumptions, we compute the probability that the kind of item appeals to the user 
u as P{E iid = l\u) = P(E iid = l|r) where r denotes the observed user u rating on item i. 
Similarly, we compute the probability that the kind of user u i( i prefers the item i as P(F u . d = 
= P(F Uid = l|r). We make another assumption that, ratings, r, given by a kind of user Uid 
to a set of items follows one distribution in the kind of items s/he prefers and another distribution 
in non-preferred kind of items. Similarly, the rating r received by a kind of item follows one 
distribution in the kind of user that the item appeals to and another distribution in the ratings 
received from the kind of users it does not appeal to. Therefore, we can draw a probabilistic 
inference about the preference of a user from his associated ratings over a set of items. By 
applying Bayes' rule to P(E iid = l|r), we get 

PIE - llr) - P(rW "' - l)P{Ei " - l) tm 



Similarly, we compute P{F u . d = l|r). By substituting the above values in Eq.pT]), we get the 
ranking function, 

tt P(r ui \Ei = 1) 

P(R = l\u,i) oc R [I 1 " 



<n r ,m r > 



Ei; itir e{o,i} P( r m„ r \Ei„ r )P(E inr ) 

P(n 



S-Fn^ r e{o,i} P( r 'iu mr \Eu mr )P{F Uri 



(19) 



where oc^ is rank equivalence and r uinr is the observed user u rating on kind of item, similarly, 
r iUm is the observed i's received rating from the kind of user u mr . 



To compute the probabilities in Eq. { fT9| ), we can use a version of the 2-Poisson mixture used for 
ad-hoc retrieval. We assume that the item's received ratings from the kinds of users to which it 
appeals follow a Poisson distribution, and a different Poisson distribution among users to whom 
it does not appeal. We make a similar assumption about the ratings of a user. Thus we have 
/ij n (1) and fi inr (0), two Poisson means of ratings received by the kind of item i Ur , and a mixing 
probability P(E iidn ) = p inr , and a two Poisson mixture for each item's ratings, with parameters 
Vum r (!) and ^u mr (0) and the mixing probability p Umr [] 

7 Although ratings on a scale [1-5] are not the same as term frequencies, the fact that they are small integers makes 
the 2-Poisson assumption work passably well. 
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Figure 2: Relevance Propagation in the Unified Model. The solid lines indicate the preference of 
U2 to the 'kind of items' and the dotted lines indicate the appeal of I2 to the 'kind of users'. 



By substituting the parameter values in Eq.([T9|) and applying logarithm, we get the final ranking 
function as 

A . C 



P(R — l\u, i) ocr lo£ 



+ log 



<n r ,m r > 



Pi nr A+(l- Pi jB » Pumr C+(l -p Umr )D 



(20) 



where 



A = e -'"»r! 1 ) //iiir (l) r »W,B = e^r(°) A t w (0) r »W,C = e^rW^^l)^, 
£, = e" At "^ (0) /i« mr (0) ri "" 1 '' • 



Eq. p0| ) is the final collaborative filtering ranking function, which making use of related user- 
item pairs to perform the calculation. It is important to note that unlike the Matrix Factorization 
methods and dimension reduction methods, such as SVD [fT3l and topic models [10] to name just 
a few, we do not need to set any specific number of hidden dimensions in which both the users and 
items will be represented. In other words, it does not involve a lower dimensional representation 
of features. Also, there is no need to compute explicitly the similarities between the users or 
items, which is the basis of the user-based approaches 11271 and the item-based approaches E51 . 



Instead, our method explores implicit similarity by computing the w's preference to a 'kind of 
item' and z's appeal to a 'kind of user' in a relevant user-item pair as shown in Fig. [2](b). By 
combining the preference and appeal of user-item pair, the relevance information of a relevant 
user-item pair will be propagated to the relevance between the u, i pair; this is illustrated in Fig. 
[2] (a). That is, if u likes a different item which also appeals to another user who likes item i, 
then these known relevant pairs will affect the probability of u, i being relevant. This is different 
compared to a unified collaborative filtering model presented in [|28l . where an unknown rating is 
estimated by explicitly similarity measures from three sources: the user's own ratings for different 
items (item-based), other user's ratings for the same item (user-based), and, ratings from different 
but similar users for other but similar items. 



4 Experiments 

In this section we present our results on Ad-Hoc retrieval only. 

4.1 Test Collections 



The objective of our experiments is to see how well the resulting rankings functions in Eq.( 14 ) and 



Eq.(20) perform in the text retrieval and collaborative filtering (CF) applications respectively. For 
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Table 1: Comparison of the Unified Model with other baseline models, t-test with 95% confi- 
dence is used and the statistically significant results (with respect to the second best models) are 
marked with f . 







Model Name & Performance 


Collection 


Metric 


BM25 


LM- JM 


Dirichlet-LM 


UM (EM) 


UM (Bayesian) 


FT-8 


MAP 


0.323 


0.317 


0.325 


0.347 


0.347f 


MRR 


0.649 


0.590 


0.664 


0.711 


0.724f 


FBIS-8 


MAP 


0.326 


0.306 


0.325 


0.315 


0.334f 


MRR 


0.598 


0.496 


0.598 


0.560 


0.614f 


LA-8 


MAP 


0.254 


0.232 


0.256 


0.260 


0.276f 


MRR 


0.565 


0.402 


0.545 


0.583 


0.594f 


TREC-8 


MAP 


0.251 


0.239 


0.256 


0.257 


0.260 


MRR 


0.644 


0.476 


0.638 


0.654 


0.670f 


TREC-7 


MAP 


0.193 


0.180 


0.192 


0.191 


0.195 


MRR 


0.652 


0.551 


0.650 


0.630 


0.667 


Robust 


MAP 


0.242 


0.185 


0.245 


0.245 


0.248 


MRR 


0.650 


0.564 


0.668 


0.620 


0.638 


TREC-10 


MAP 


0.193 


0.148 


0.193 


0.190 


0.195 


MRR 


0.596 


0.451 


0.588 


0.60 


0.611 



the ad-hoc retrieval evaluation, we used five different TREC document collections, representing 
small to medium sizes: 1) FBIS on disc 5, 2) Financial Times (FT) on disk 4, 3) LosAngeles 
Times (LA) on disk 5, 4) TREC-7 and TREC-8 ad hoc retrieval document collection, Disk 4 & 
5 minus Congressional Record, and 5) WT10G collection. The topic sets used are: 1) topics 
301-350 , 2) topics 401-450, 3) topics 501-550 and 4) topics 301-350 and 601-700 minus 672. 
We use the document collection followed by the TREC number as a label for the test collection, 
e.g. FBIS-8 represents the test collection with FBIS document collection and TREC-8 topics (i.e. 
401-450). Similarly, labels, Robust, TREC-10 represents the TREC, 4&5 document collection 
with Robust topics and WT10G collection with topics 501-550 respectively. For each of these 
collection queries are formed from the title field only. 

We also initialized the mixture parameters by using the collection statistics as follows: For the 
EM algorithm, we initialized p as the percentage of the documents where the term occurs. Thus 
the initial rank function is equivalent to the IDF weighting (see the discussion in Section 3). We 
used a minuscule value to initialize /i(0) by assuming that the average term frequency of a term 
associated with the term-description elite property in a document approaches zero if it is non- 
elite to the document. Similarly, was initialized with the average number of times the term 
appeared in document collection with its term frequency in a document more than one. For the 
Gibbs sampling, we chose the prior parameters values in the similar fashion. 

Performance: After learning the parameters from each document collection, we employed our 



ranking function in Eq. ( |T4| ) on each test collection and computed the performance metric scores. 
Table [T] summarizes the results of the unified ranking function (UM), using EM & Bayesian 
estimation, along with the results of the baselines. The labels LM-JM corresponds to the lan- 
guage modeling ranking method lfT8l with Jelinek-Mercer smoothing, whereas Dirichlet-LM cor- 
responds to the Language Model with Dirichlet prior. From Table [T] we can see that our ranking 
function outperforms other models in most cases (some of them are significant). Because the rank 
function does not use any information other than the term statistics in document collection, we 
believe the improvement was due to the term-based parameters estimation, similar to the per-term 
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smoothing in the Poisson based query-generation language models IfToll . Moreover, the perfor- 
mance of our model on title queries is comparable to the improved reported results in [16]. In 
summary, the ad-hoc retrieval experiments show that the unified retrieval theory has great poten- 
tial in text retrieval. A simple ranking function derived from our unified theory demonstrates that 
it can handle the retrieval situation without relevance feedback. 

5 Conclusion 

We have presented a new unified theory for information retrieval. We considered retrieval as 
a matching problem between two sets of properties, one from information needs and one from 
information items. To estimate the probability of relevance between them, we argued that the 
retrieval system not only needs to identify which item properties are 'sought' by each need, but 
also to identify which need properties are 'sought' by each item. We validated the proposed 
theory by formulating and developing practical relevance ranking functions for both ad-hoc text 
retrieval and collaborative filtering. We evaluated ad-hoc retrival ranking function performance 
on publicly available test collections (TREC collections for ad-hoc retrieval task). Besides the 
theoretical contribution, our experiments demonstrated its wide applicability. 

There are fruitful avenues for future investigations into the proposed unified retrieval framework. 
For instance, we intend to extend and test the current text retrieval rank function and apply it to 
web search where relevance information is available (in the form of click-through data). It is of 
great interest to study the theory in other IR applications such as content filtering, multimedia 
retrieval, people matching and search, opinion retrieval flUl, and advertising. 
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